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ABSTRACT 

Microstructures forming during ternary eutectic directional 
solidification processes have significant influence on the macro¬ 
scopic mechanical properties of metal alloys. For a realistic 
simulation, we use the well established thermodynamically 
consistent phase-field method and improve it with a new 
grand potential formulation to couple the concentration evo¬ 
lution. This extension is very compute intensive due to a 
temperature dependent diffusive concentration. We signifi¬ 
cantly extend previous simulations that have used simpler 
phase-field models or were performed on smaller domain sizes. 
The new method has been implemented within the massively 
parallel HPC framework waLBerla that is designed to exploit 
current supercomputers efficiently. We apply various opti¬ 
mization techniques, including buffering techniques, explicit 
SIMD kernel vectorization, and communication hiding. Sim¬ 
ulations utilizing up to 262,144 cores have been run on three 
different supercomputing architectures and weak scalability 
results are shown. Additionally, a hierarchical, mesh-based 
data reduction strategy is developed to keep the I/O problem 
manageable at scale. 

1. INTRODUCTION 

The microstructure in alloys significantly influences the 
macroscopic properties. In order to predict and optimize the 
resulting macroscopic material parameters, it is important to 
get a deeper insight into this microscopic structure formation. 
Especially in ternary systems with three chemical species, 
a wide range of patterns can form in the microstructure, 
depending on the physical and process parameters [25| |19[ 


|13| . In this work, we show simulation results for the pattern 
formation in ternary eutectic directional solidification of Ag- 
Al-Cu alloys. The pattern formation of this ternary eutectic 
system, during directional solidification, was studied experi¬ 
mentally by Genau, Dennstedt, and Ratke in |13| [^ [^. The 
thermodynamic data are reported in the Calphad database 
and are derived from 36 . This system is of special inter¬ 
est because many essential aspects of solidification processes 
can be studied due to the low temperature of the ternary 
eutectic point and similar phase fractions in micrographs. In 
technical applications, this alloy is used for lead free solders 
in micro-electronics 17 . However, an experimental study 
of three dimensional structure requires significant technical 
effort, e.g. with synchrotron tomography. Simulations of the 
solidification process exhibit another way of studying these 
structures, providing the whole microstructure evolution. 

Phase-field models are well established to simulate solidifi¬ 
cation problems. In 2011, Shimokawabe et al. presented 
the first peta-scale phase-field simulation on the TSUBAME 
2.0 GPU-Cluster. They studied binary dendritic growth, 
the evolution of multiple nuclei in three dimensions under 
consideration of temperature, and concentration instabilities. 
Further investigations of this binary dendritic growth were 
performed by Yamanaka et al. 


37 and Takaki et al. 31 30 


at the TSUBAME 2.0 and TSUBAME 2.5 GPU supercom¬ 
puter. In the Gordon Bell Prize 2011 awarded work, used 
a phase-field model with two phases and two components 
based on a simplified free-energy approach, neglecting the 
temperature dependence in the concentration evolution equa¬ 
tion and hence the slope of the solidus- and liquidus lines, 
continued the work of binary dendritic growth in 2D of 
by using an improved model. There, an anti-trapping 


current was added to obtain better quantitative results, but 
also leading to a more complex model. This issue of solving 
multiple non-linear partial differential equations and hence 
the increase of required memory and computational time was 
also mentioned by Yamanaka et al. [38| . 

In our work we focus on simulations of ternary eutectic 
directional solidfication with a phase-field model based on 
the grand-potential approach [^. Ternary eutectic growth 
describes the transformation of a melt in three solid phases 
at a defined ternary eutectic point. To study this ternary so- 










lidification with structures approximately two orders smaller 
than dendritic growth, four phases and three components 
have to be considered in the model. The grand potentials, 
derived from thermodynamic databases, are coupled to the 
phase-field to ensure a physical driving force with the slopes 
of the solidus and liquidus planes [^ |^ |^ . 

In our simulations, we use an extension to the well estab¬ 
lished thermodynamic consistent phase-field method, includ¬ 
ing an anti-trapping current, based on a newly formulated 
grand potential approach 16 . The phase-field and chemical 
potential formulation consist of a relatively large number of 
complicated terms and parameters, making the resulting algo¬ 
rithm significantly more compute intensive compared to other 
stencil codes such as advection-diffusion calculations. An im¬ 
plementation of the described method was already available 
in the general purpose phase-field framework PACE3D [33| , 
which provides a wide range of phase-field models coupling 
structural |28[ , fluid , and thermal effects . The 

available code was, due to its general approach, not optimized 
to study the considered phenomena in a most efficient way. 
Simulations of large three dimensional domains are required 
to capture certain physically relevant effects, which can not 
be seen in smaller domains due to strong boundary influ¬ 
ences 16 . Therefore, we re-implemented and optimized the 


specific model in waLBerla, a massively parallel framework 
for stencil-based algorithms on block-structured grids [11| . 
The framework has been shown to efficiently scale up to 1.8 
million threads and run lattice Boltzmann simulations with 


up to one trillion cells 15 


Beginning at the intra-node level by making use of SIMD 
instructions, to multithreading at intra-node level, up to 
internode parallelism via MPI, several layers of parallelism 
have to be exploited to efficiently use modern supercomputer 
architectures. We apply several optimization techniques at all 
these levels, resulting in an overall relative speedup of factor 
80 compared to the original code and reach approximately 
25% of the peak FLOP rate on the SuperMUC petascale 
system located at LRZ Munich. We show scaling results 
on three of the largest supercomputers in Germany: on 
SuperMUC, JUQUEEN, and HORNET. 

The results obtained in our simulations show excellent 
agreement with experimental two dimensional micrographs 
and three dimensional tomography |16|. 


2. MODEL 

To simulate the solidification process of ternary eutectic 
systems, a thermodynamically consistent phase-field model 
is used. The presented phase-field model of Allen-Cahn type 
consists of two coupled evolution equations, for the vector 
of order parameters cf) and the vector of chemical potentials 
/X. The equations are solved using finite difference methods 
and an explicit Euler scheme for the time discretization. 
The domain is discretized in equidistant regular cells with 
the spatial position x. To indicate the spatial discretization 
scheme of the evolution equations, we introduced the “DnCm” 
notation, as a stencil with n dimensions and total access on 
m cells. In the domain fl, the order parameters (f)a{x,t), 
a e [1,A^], describe the fractions of the N thermodynamic 
phases in each cell and each time step t. We define cf>(x,t) 
on the regular - 1 simplex A^~^,Va:,t. The evolution 


equations of the phase-fields are written as 
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The reciprocal relaxation parameter Tq, couples the phase- 
field evolution with the physical time scale and e is related to 
the interface width. The gradient energy density Vcf>) 
and the potential energy density describe the interfacial 
energy part in The driving force describes 

the thermodynamic process of the phase transition caused 
by the undercooling and connects the evolution of the order 
parameter to the chemical potential. This force is derived by 
parabolically fitted Gibbs energies which are derived from 
the thermodynamic Calphad databases [m. The evolution 
equations of the K chemical potentials i^x,t) are derived 
from Fick’s law and the variational derivation of the con¬ 
centrations c{x,t). Due to the conservation of mass, one 
concentration can be derived from the others. The num¬ 
ber of evolution equations c{x,t) as well as the number of 
components of fJ,{x, i) can therefore be reduced to A' - 1. 

The /X evolution is written as 
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The Jacobian matrix dcjdfj. describes the susceptibility and 
is derived from parabolic free energies. The two flux terms 
are a gradient flux of the chemical potential depending on the 
mobility M{cf>,T) and a flux, called anti-trapping current 
[^. It is derived as 
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with the interpolation function hi [23| . 

For the directional solidification, we use a frozen temper¬ 
ature assumption by imprinting an analytical temperature 
gradient with a defined velocity. 

For the description of the numerical optimizations, we 
distinguish different regions in the simulation domain. The 
region Be ■= {a; e Q\(j)c,{x,t) = 1 A |V<7ici| = 0}, where only 
one phase a exists, is called bulk region. In this region, the 
time derivative of the order parameter and the anti-trapping 
current 0 are zero. 

The diffuse interface /q := Q \ Ucksiv is located between 
bulk regions. The interfacial energy part and the driving force 
in 0 are exclusively calculated in Iq . The solidification front 























is defined as Fn := {a: £ 7n | (j)e{x) > 0}. The liquid region is 
defined as Lq := Be and the solid one as Sn := \ {Lq u Fq). 

For a more detailed description of the model we refer to [16[ 

in- 

2.1 Phase-field Algorithm 

The algorithm to calculate the phase-field model is de¬ 
composed into two kernels: one for updating values of the 
phase-field itself 0 and one to update the chemical potential 
©• Two lattices are allocated for each variable: two desti¬ 
nation fields denoted by and fidst and two source fields, 
denoted by (^arc and /Xarc- In the destination fields values for 
the next time step t + At are stored, while source fields hold 
values for the current time step. Each kernel performs a loop 
over the local simulation domain and updates all cell values 
of the (j> OT field. After each kernel run, the ghost layers 
are synchronized with neighboring blocks and boundaries are 
updated using Dirichlet, Neumann and periodic boundary 
conditions as shown in Figure]^ The details of this update 
scheme are depicted in Algorithm 


periodic BC moving window 
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Figure 2: Simulation setting for the directional solid¬ 
ification of a ternary eutectic system including the 
boundary conditions. Below, the analytical temper¬ 
ature gradient G with the velocity v and its direction 
is shown. 


Algorithm 1 Timestep 

1: <^dst ■<- (/>-kernel(0arc,Msrc) (see Q) 

2: (fidat ghostlayer communication 
3: (jjdat boundary handling 
4: /idst /r-kernel(/i src 0src; 0dst) (see 0 ) 

5: /idst ghostlayer communication 
6: fj-dat boundary handling 

7: Swap (f>BTC ^ (/Jdat and /isrc fJ-dat 


The 0-kernel needs the previous phase 0src and chemical 
potential fiaic values as input. Direct neighborhood values 
are required to compute gradients of the phase-field, leading 
to a D3C7 stencil for the 0-field while only local values are 
required for p (D3C1). For updating the chemical poten¬ 
tial, previous chemical potential values psrc as well as the 
phase-field values of two time steps (0src and 0dst) are nec¬ 
essary. The two phase-field timesteps and the D3C19 stencil, 
including diagonal neighbors, are required to calculate 0. 
Figure shows these data dependencies in detail for the two 
kernels. 
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Figure 1: Data dependencies of the two kernels. 

As initial setup we use solid nuclei at the bottom of a 
liquid filled domain as shown in Figure These solid nuclei 
are created by a Voronoi tesselation with respect to the given 
volume fractions of the phases. The simulation parameters of 
[I^ are used for our studies in the presented model, describing 
the system Ag-AI-Cu. 

3. IMPLEMENTATION 

In the following section, we present our implementation of 
the phase-field model and apply optimizations necessary to 
efficiently run on current supercomputing systems. 


3.1 The waLBerla Framework 

The phase-field method described above was implemented 
in the waLBerla framework (widely applicable lattice Boltz¬ 
mann from Erlangen). As the name suggests, the framework 
was initially developed for simulations using the lattice Boltz¬ 
mann method. Over time, it evolved into a multi-physics 
framework for efficient implementations of stencil-based algo¬ 
rithms on block structured grids in a massively parallel way. 
waLBerla partitions the simulation domain into equally 
sized chunks, called blocks. On each block, a regular grid is 
allocated, extended by one or more ghost layers for commu¬ 
nication in MPI parallel simulations. This concept provides 
the flexibility for supporting complex geometries and 
optional mesh refinement. The regular structure inside the 
blocks allows for highly efficient compute kernels. The data 
structure storing the blocks is fully distributed: Every pro¬ 
cess holds information only about local and adjacent blocks. 
Thus, the memory usage of a process does not depend on the 
total size of the simulation but only on the size of local and 
neighboring blocks. Only during the startup phase which is 
responsible for block setup and load balancing, one process 
has to store global domain information. This initialization 
can be executed independently of the actual simulation. The 
resulting block structure is then stored in a file to be loaded by 
the simulation at runtime. The framework is entirely written 
in C-I--I- with small portions of the code being automatically 
generated. For performance critical code sections, we make 
use of template meta programming techniques to achieve 
good performance without losing flexibility and usability. 

3.2 Input/Output 

A big challenge in massively parallel programs is dealing 
with the huge amount of output data. In many big simula¬ 
tions, I/O operations can become a significant bottleneck. 
Therefore, we try to minimize the amount of data that is 
read and written to the file system. Since the initial domain 
setup is generated using a Voronoi approach, we do not have 
to load big, voxel-based input files describing the domain. 
I/O is only necessary for checkpointing and for result output. 
For generating checkpoints, the complete simulation state 
has to be stored on disk, containing four 0 values and two 

values per cell. While all computations are carried out in 
























double precision, checkpoints use only single precision to save 
disk space and I/O bandwidth. Writing a checkpoint can 
take a significant amount of time compared to a simulation 
time step, therefore checkpoints are written infrequently. 

Simulation results have to be written more often, thus a 
faster technique is required for this task. Instead of writing 
all values of a cell, we only store the position of the inter¬ 
faces using a triangle surface mesh. Therefore, a custom 
marching cubes algorithm based on was developed that 
generates meshes locally on each block, using the (f) values 
as input. For each phase, a mesh is generated, describing 
the interface between this phase and all other phases. The 
marching cubes algorithm extends to the ghost regions such 
that the local meshes can be stitched together to a single 
mesh describing the complete domain. These local meshes 
could be written to disk, leaving the stitching for postpro¬ 
cessing. Since the marching cubes algorithm creates triangles 
with edge lengths in the order of dx, these meshes are still 
unnecessarily fine and could be adaptively coarsened without 
losing much accuracy. This coarsening step can optionally 
be performed before writing the mesh to keep output file 
sizes small. For mesh coarsening, we use the quadric-error 
edge-collapse-based simplification algorithm [12| from the 
Visualization and Computer Graphics Library (VCG) [^. 
The coarsening is done in a hierarchical way: In a first step, 
each process calls the edge-collapse algorithm on its local 
mesh. By assigning a high weight to all vertices that are 
located on block boundaries, the boundaries are preserved 
such that the later stitching step can work correctly. Then, 
two local meshes are gathered on a process, stitched together, 
and again coarsened in the stitched region. This step is 
repeated logj(processes) times where in each step only half 
of the processes take part in the reduction. This procedure is 
stopped if the mesh is either fully reduced or has reached a 
size that cannot be stored in the memory of a single node. In 
the latter case, postprocessing can be resumed on a machine 
with more memory than a compute node. 

3.3 Optimizations 

Starting with a very general phase-field model, the goal 
of this work was to develop a highly optimized code for the 
specific model, presented in [^, which efficiently can simulate 
sufficiently large 3D domains. Optimizations are done on 
different levels by exploiting physical, mathematical, and 
computational facts to decrease the computation time. 

For directional solidification, we can reduce the effective 
domain size in the solidification direction {Nz) by using 
a moving window technique [33| as shown in Figure]^ The 
evolution in the solid is multiple magnitudes lower than in 
the liquid such that we neglect the evolution in the solid. In 
order to produce physically correct microstructures, the base 
size of the domain Nx x Ny requires a minimal size to reduce 
the influence of the periodic boundaries. 

The evolution of the temperature can be described by 
a frozen temperature ansatz in solidification direction, by 
exploiting the time scales of the different evolution equations 
|16[ [ 7 ]. At a given time t, we assume the temperature being 
constant in slices orthogonal to the solidification direction. 

The grand potentials for the driving force are only needed 
in the range of the ternary eutectic point, therefore we use 
fitted parabolic Gibbs energies to derive the potentials instead 
of describing the total ternary system. This simplifies the 
calculations involving the concentrations c and the chemical 


potentials fi [^. 

The model equations can be simplified for certain parts of 
the domain for the two evolution equations. The computation 
of the anti-trapping current in the /r-evolution is only required 
in the interface region Fn of the solidification front. Since 
evaluating this current is computationally expensive, we 
detect as early as possible if the complete term has to be 
evaluated by testing critical subexpressions for zeros. When 
computing Jat, we first check if cf> is zero since then and 

thus Jat are zero. By introducing one additional check, we 
can omit the calculation of the expensive Jat in all cells which 
do not contain liquid. A similar check can be added for V4>i'- 
for cells with zero liquid phase gradient, the computation 
of the anti-trapping current can be skipped as well. The 
interface region 7n is bounded due to a sinus-shaped interface 
profile in which t 0. Following the calculation of the 
(j> evolution is only required in this small interface region. 
Adding these kind of checks introduces possibly expensive if 
conditions in the kernel, so there may be a trade-off between 
the saved computations and the peak performance. As shown 
in section the introduced checks reduce the total runtime 
considerably. 

We conducted further optimizations on the source code 
and hardware level using processor extensions available on 
the target machines. On these levels, we used a systematic, 
performance model driven approach to identify and eliminate 
relevant bottlenecks. As we show in section our code is 
bound by in-core execution, therefore the goal is to reduce 
the total number of floating point operations, potentially at 
the cost of increased memory traffic. 

The fact that in our specific scenario the temperature is a 
function of the z coordinate led us to choose the 2 iteration 
as the outermost loop in computation kernels, followed by 
loops over y and x. Since many values, which are required 
to calculate the driving force ip of or the anti-trapping 
current Jat of (^, depend on analytic temperatures only, 
these values can be pre-calculated once for each 2 -slice instead 
of computing them again in every cell. 

Following the strategy of saving floating point operations 
at the cost of memory transfers, we identify expressions in 
the model equations that are evaluated multiple times and 
can therefore be buffered. In our discretization scheme we 
also evaluate quantities at staggered grid positions. To avoid 
evaluation in each cell, they are written to a buffer after 
computation and are reused. Since the outermost loop in the 
computation kernels iterates over the 2 coordinate, a buffer 
of the size Nx x Ny is needed. 

The computationally most intensive part of equation ® 
is the calculation of the divergence of Ubuf := (MS/fj, - Jat), 
even if the evaluation of the anti-trapping current can be 
skipped for many cells. In order to update one cell, the six 
values of Ubuf at neighboring staggered positions are required. 
However, three of them can be buffered and reused since they 
have already been calculated during the update of previous 
cells (Figure]^. Only in cells located at block boundaries, 
all six values have to be computed explicitly. Using this 
buffering approach, we effectively halve the required floating 
point operations at the cost of additional memory accesses 
for reading and writing the buffered values. 

The same technique of buffering staggered values is also 
applied for evaluating the divergence of da{(p, <p) jdV(pa in 
the phase-field update step. 

While adhering to the established data parallel approach 
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Figure 4: Data dependencies (with communication 
overlap) 


Figure 3: Buffering values at staggered positions 
(D3C7). Circles mark cell midpoints, lines indicate 
cell boundaries. 


on inter-process level, the waLBerla framework offers task 
parallel programming models on intra-process level to sup¬ 
port overlapping communication with computation. The 
computation kernels as well as the ghost layer exchange rou¬ 
tines are implemented as C-I--I- functors, which are registered 
at a “Timeloop” class to manage the communication hiding. 

The communication of the chemical potential field can 
be hidden in a straightforward way since the following up¬ 
date of the phase-field only depends on local /r values (Fig¬ 
ure . This allows us to communicate the /r boundaries 
while updating the (/)-field. In our implementation the order 
of communication and boundary handling routines can also 
be interchanged without altering the results. 

Since updating ^ accesses neighborhood values of (j) for the 
current and new time step, the communication of the phase- 
held values can not simply be overlapped with /r computation. 
In order to hide the phase-held communication as well, the 
fj, update step has to be split up into two kernels. The obvi¬ 
ous method would be to update inner values hrst and after 
communication has hnished, update the values at the border. 
Implementing this solution would require a staggered value 
buffer of the same size as the complete held or recomputation 
of staggered values. Thus, we split up the fi update into a 
local and a non-local <j> dependency as indicated in it . This 
splitting simplihes the data dependencies (Figure]^ and does 
not affect the buffering of staggered values, but still has some 
overhead, since now the temperature dependent values have 
to be computed twice for each ^-slice. Therefore, we calculate 
the /i evolution without the anti-trapping current @ hrst. 
After transfer of the ghost layers, the anti-trapping current 
is calculated and added to the jj, evolution. Algorithm |3.3| 
shows the resulting time step of the phase-held method using 
fully overlapping communication. 


Algorithm 2 Timestep with communication overlap 
1: communicate fj,src 
2: <(>dst (j)-sweep(^(j) src /^src^ 

3: end communicate 

4: <j)^st boundary handling 

5: communicate <()dst 

6. /Tdst ^ /.^-SWeep-loCal^^s^-c, <(*src; ^dst^ 

7: end communicate 

8 : pdst /r-sweep-neighbor(/isrc,<(>src,<(>dst) 

9: pdst boundary handling 
10: Swap (j)BTc ^ 4>dst and Hbtc ^ fJ-dst 


In order to fully utilize modern hardware architectures, 
especially when running compute intensive codes, it is im¬ 
portant to make full use of their SIMD (single instruction, 
multiple data) capabilities. While most compilers can do 
automatic loop vectorization, it is still necessary to provide 
additional information to the compiler regarding aliasing, 
data alignment, and typical loop sizes using non portable 
pragma directives. An analysis with the performance mea¬ 
surement tool LIKWID 32 , which can measure the amount 
of vectorized and non-vectorized boating point operations, 
showed that large portions of our code were not automati¬ 
cally vectorized. While the pragma approach proves useful 
for small concise loop kernels, we could not improve the more 
complex compute kernels for (j) and fi of several hundred lines 
of code using this approach. Instead, we explicitly vectorize 
these two main compute kernels with intrinsic compiler func¬ 
tions. Using intrinsics requires a manual reformulation of the 
algorithm in terms of vector data types. Since these intrinsic 
functions directly map to assembler instructions, they are not 
portable across different architectures. In order to keep the 
code portable, a lightweight abstraction layer was developed 
that provides a common API supporting the Intel processor 
extensions SSE2, SSE4, AVX, AVX2 and the QPX extension 
on Blue Gene/Q cores. Not all functions of this API directly 
map to a single intrinsic function and therefore to a single 
assembler instruction for each instruction set. For example, 
AVX2 provides instructions for permuting elements of vector 
data types, which require two or more instructions in the 
SSE extension. Our API hides these differences, providing 
functions for all AVX2 and QPX intrinsics which are emu¬ 
lated on older extensions in the most efhcient way. Manual 
inspection of the assembler code shows that calls to this 
thin SIMD API are inlined by the compiler and therefore no 
additional overhead is introduced. 


4. HPC SYSTEMS 

In this section we shortly describe the three Tier-0/1 
cluster systems at the High Performance Computing Cen¬ 
ter Stuttgart (HLRS), the Leibniz Supercomputing Center 
(LRZ), and the Jtilich Supercomputing Centre (JSC). All 
systems support vectorization, either using AVX(2) on the 
Intel based chips or QPX on BlueCene/Q cores. 

4.1 SuperMUC 

The SuperMUC system located at Leibnitz Supercomput¬ 
ing Center in Munich is built out of 18,432 Intel Xeon E5-2680 
processors running at 2.7 GHz resulting in a total number of 
147,456 cores [^. One compute node of SuperMUC consists 
of 2 sockets, each equipped with 8 cores. 512 nodes are 
divided into one island. Within each island, SuperMUC uses 
a non-blocking tree network topology, whereas all 18 islands 
are connected via a pruned tree (4:1) On SuperMUC 












we used the Intel compiler in version 14.0.3 together with 
IBM MPI. We compiled our code on optimization level 3 
together with optimization flags enabling interprocedural 
optimization at link time and fast floating point operations. 

4.2 Cray XC40 (Hornet) 

The Cray XC40 system at the HLRS consists of 3944 
nodes with two sockets, each containing a Intel Haswell E5- 
2680v3 with 12 cores. The systems contains 94,656 cores in 
total which are interconnected by a Dragonfly network [18| 
called Cray Aries. The Haswell core clock rate is 2.5 GHz 
and it supports Hyper-Threading as well as the newer AVX2 
instruction set. On Hornet we have chosen the same compiler 
and compiler options as on SuperMUC with additional flags 
to make use of the AVX2 extension. 

4.3 JUQUEEN 

JUQUEEN is, as of April 2015, the fastest supercomput¬ 
ing system in Germany and is positioned on rank 8 on the 
TOP500 list [2]. It is a 28 rack Blue Gene/Q system pro¬ 
viding 458,752 PowerPC A2 processor cores, with each core 
capable of 4-way multithreading [14| . JUQUEEN uses a 
5-dimensional torus network topology capable of achieving 
up to 40 GB/s, with latencies in the range of a few hundred 
nanoseconds. We compiled our code with the IBM XL com¬ 
piler in version 12.1 with the compiler optimization level set 
to 5. 

5. RESULTS AND DISCUSSION 

In this section, we present single core performance results 
as well as scaling experiments run on these three super¬ 
computing systems. The presented performance results are 
measured in MLUP/s, which stands for “million lattice cell 
updates per second”. At the end of this section, we show 
simulation results of the Ag-AI-Cu system and compare them 
to experimental data. 

5.1 Performance Results 

Since the performance of the compute kernels depends 
on the composition of the simulation domain as shown in 
section [33] we executed our benchmarks separately for three 
representative parts of the domain. These different scenarios 
are labeled interface, solid, and liquid and correspond to the 
regions Fn, Ln, and Sn defined in Sec.[^ The solid scenario 
consists purely of already solidified material, as is the case in 
a realistic simulation in the lower third of the domain. The 
interface scenario simulates only the solidification front, i.e. 
the middle third of the simulation domain. The upper part 
of the domain consists only of liquid phase and is represented 
by the liquid scenario. 

5.1.1 Single Core Performance 

The starting point of our HPC implementation was a 
general phase-field code written in C. One main design goal of 
this application code is flexibility to allow rapid prototyping 
and testing of new models. A wide range of models is already 
implemented in this code: various phase-field models, a 
structural mechanics as well as a fluid mechanics solver, which 
can all be coupled with each other. As already mentioned 
above, for the specific model described here, simulations 
of large three dimensional domains are necessary to obtain 
physically meaningful results. This motivates the design and 


implementation of a new code that is highly optimized and 
parallelized to exploit the largest HPC systems available. 

A straightforward re-implementation of the grand chemical 
potential based phase-field model in the waLBerla frame¬ 
work already yields significant performance improvements 
(Figure]^. Since this new implementation is targeted at one 
specihc model, certain basic optimizations can already be 
applied easily to the compute kernels: While the original 
implementation makes heavy use of indirect function calls 
via function pointers at cell level, we either remove these 
indirections in the waLBerla version entirely due to the 
specialization of the code to one specific model, or replace 
them with static polymorphism using C-|—I- templates. Ad¬ 
ditionally, we replaced divisions where the denominator is 
guaranteed to be in a small set of values by table lookup 
and multiplication with the inverse. The number of division 
operations is further reduced by replacing inverse square 
root calculations, required for vector normalizations, with 
approximated values provided by a fast inverse square root 
algorithm [20| . 

In a second step we implement more advanced, aggressive 
optimizations. These include explicit SIMD vectorization in 
both computation kernels. Since the targeted architectures 
all have a vector width of four double precision values, the 
straightforward way for vectorizing the algorithm is to unroll 
the innermost loop, updating four cells in one iteration. 
While this technique is the only possible one for the /r-kernel, 
a more natural approach exists for the (/)-kernel of our specihc 
model: Instead of handling four cells simultaneously, we 
can use a SIMD vector to represent the four phases of a 
cell. With this technique, the Held is still updated cellwise, 
such that branching on a cell-by-cell basis becomes possible. 
This branching can signihcantly speed up the kernels, since 
some expensive terms have to computed only for certain 
cell conhgurations. Vectorized kernels that handle four cells 
simultaneously can only take these shortcuts if the condition 
is true for all four cells. Since the single cell kernel version 
operates on less data per iteration compared to the four 
cell version, more intermediate values can be kept in vector 
registers, which would have to be spilled to memory otherwise. 
A drawback of the cellwise version is the need for various 
permute or rotate operations when computing terms that 
contain single components of the 4> vector. This happens 
for example when expressions like 4>a 4>P have to be 

evaluated. 

Figure[^shows the performance of three vectorized 0-kernel 
implementations: The Hrst two implementations both use the 
cellwise vectorization approach. While the Hrst version eval¬ 
uates all terms for each cell, the second version contains tests 
that determine which terms have to be evaluated [cellwise 
with shortcuts ). The third version uses four-cell-vectorization 
and can only skip terms that are not required for all of the 
four cells. In all three parts of the domain, the single cell 
kernel with shortcuts performes best. The additional costs 
of the vector permute operations are alleviated by the ability 
to branch on a per cell basis and the reduced number of 
required SIMD registers. For all further benchmarks, we use 
this fastest cellwise kernel with shortcuts to update the phase 
Held. Since we apply two different vectorization strategies 
to the two compute kernels, the optimal data layout of the 
0-Held for the two kernels is different: The ^-kernel processes 
the data four cells at a time, so a structure-of-arrays (SoA) 
layout would be the best choice, while the fastest 0-kernel 



i interface JHUmillll^millllimiH 

vmsBBBsasm 

0 2 4 6 8 10 12 14 

MLUP/s for 0-kernel only 

Figure 5: Comparison of different vectorization 
strategies on one SuperMUC core, block size cho¬ 
sen as 60®. 

requires an array-of-structures (AoS) layout to be able to 
load a SIMD vector directly from contiguous memory. We 
choose a SoA layout since the /r-kernel has to load phase 
field values of 38 cells (19 cells from and 19 from 

(j}{x,t + At)) whereas the 0-kernel only has to load 7 cells. 
Due to the high computational intensity of the 0-kernel, no 
notable differences could be measured in the 0-kernel per¬ 
formance after a data layout change of the 0-field. Figure 
shows the increase in performance of the new SIMD kernels 
compared to the basic implementation. By applying vector¬ 
ization only, a maximal speedup of factor 4 can be expected. 
The high speedup of factor 5 to 7 is due to the fact that in 
addition to vectorization also additional optimizations like 
common subexpression precomputation were done at this 
stage. This comes at the cost of decreased flexibility: While 
in the basic implementation the code was structured accord¬ 
ing to the mathematical formulation of the model, the single 
terms of the model can hardly be recognized in the manually 
vectorized SIMD kernels. To decrease the maintenance effort 
for the various kernels, a regularly running test suite checks 
all kernel versions for equivalence. 

Figure [^additionally shows the performance improvement 
due to optimizations we can make when the temperature is 
prescribed as a function of time and one space component z 
only. In this case, we precompute all temperature dependent 
terms once for each x-y-slice instead of computing them in 
each cell. Using this optimization increases the performance 
of the /i-kernel by approximately 20% and the performance 
of the 0-kernel by 80%. 

While this technique improved predominantly the runtime 
of the 0-kernel, the staggered buffer optimization is targeted 
at the /r-kernel and the costly computation of the term 
V■ (ATV/r - Jat) which is evaluated at staggered positions in 
the grid. Buffering and reusing half of these values increases 
the /i-kernel performance by almost a factor of two, which 
shows that the runtime of this kernel is dominated by the 
calculation of these staggered values. In the 0-kernel, the 
same technique is applied for buffering gradient values at 
staggered positions. In this case, the buffered values are not 
as expensive to compute as the buffered quantities in the 
/r-kernel, therefore this optimization leads only to slightly 
better performance for this kernel. 


Up to now, all performance improvements affected all cells 
of the domain. While the kernel runtime for updating fz 
is, up to measurement error, equal in the complete domain, 
the 0-kernel runtimes vary slightly due to branches in a 
routine that projects the 0 values back into the allowed 
simplex. The following optimization introduces additional 
branches by skipping the evaluation of terms depending on 
the configuration of the current cell. This implementation 
is labeled as version “with shortcuts” in Figure [^ and was 
already included in the comparison of different vectorization 
strategies, shown above. These branches lead to different 
runtimes of the kernels in different parts of the domain. The 
performance of the 0-kernel is increased predominantly in 
liquid parts of the domain since there the computation of the 
coupling term 0 can be skipped entirely. The runtime of the 
/r-kernel is improved especially in solid cells due to a simpler 
calculation of the anti-trapping current in these cells. 

All optimizations combined result in a total speedup of 
up to 80 depending on the architecture. This huge speedup 
compared to the original C implementation enables us to 
simulate sufficiently large domains in reasonable time to get 
physically meaningful results. 

Since the relative speedup does not indicate to what extent 
the target architecture is utilized, we additionally investi¬ 
gated the absolute performance of our code. In the following, 
we focus on the singlenode performance of our optimized 
code, without the “shortcut” optimizations, since in this case 
the total number of executed floating point operations per 
cell can be determined exactly. 

We show the performance analysis for the /r-kernel on a 
SuperMUC node in detail, and shortly present the results for 
the 0-kernel afterwards. First, we use a rooflme performance 
model to determine if the code is memory or compute bound 
[34| . We measure the maximum attainable bandwidth using 
STREAM [22| on one node, resulting in a bandwidth of 
approximately 80 GlB/s. We assume that approximately 
half of the required data for one update can be held in cache, 
since in a symmetric D3C7 or D3C19 stencil half of the data 
can be reused if one x-y-slice of both fields completely fits 
into cache. For a typical block size of 40®, one x-y-slice of 
all four fields fits into L2 cache. Under this assumption, 
half of the required values are held in L2 cache and at most 
680 Bytes have to be loaded from main memory to update 
one cell. For one cell update, 1384 floating point operations 
are required, resulting in a lower bound for the arithmetic 
intensity of approximately two floating point operations per 
byte. Taking only the memory bottleneck into account our 
code could achieve up to 126.3 MLUP/s on one SuperMUC 
node: 

80— : 680-^ = 126.3 MLUP/s 
s IjUi 

The benchmark results presented in Figure show that our 
code does not hit the bandwidth limit of 126.3 MLUP/s and 
therefore is not memory bound. Choosing a smaller blocksize 
of 20®, which fits entirely into L3 cache, only changes the 
performance slightly. This is a further indication that the 
code is limited by in-core execution instead of memory band¬ 
width. To determine an upper bound for in-core execution 
we compare the attained FLOP/s rate with the maximal 
possible rate. One SuperMUC core runs with a frequency of 
2.7 CHz [^ and can perform 8 floating point operations per 
cycle, resulting in 21.6 CFLOP/s per core. The achieved 4.2 
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shortcut optimization on one SuperMUC node 


MLUP/s per core correspond to 5.8 GFLOP/s and therefore 
to 27% peak performance of one core. The in-core execution 
time is therefore not limited by the total number of floating 
point operations. To understand why we cannot utilize the 
full computational capability of one core, we employ the Intel 
Architecture Code Analyzer (lACA) [^. This tool inspects 
the assembly code of the kernel and statically predicts the 
estimated execution time on a given Intel architecture under 
the assumption that all data resides in LID Cache. The 
lACA analysis shows that even though the code is fully vec¬ 
torized, it can attain at most 43% peak under ideal front-end, 
out-of-order engine, and memory hierarchy conditions. This 
is caused predominantly by imbalance in the number of ad¬ 
ditions and multiplication as well as latencies for division 
operations. The lACA report shows, that further, highly ar¬ 
chitecture dependent optimizations are possible, for example 
manually reordering addition and multiplication instructions. 
These kind of optimizations would however require signifi¬ 
cant development effort due to the high complexity of the 
kernel of several thousand lines of code. A similar analysis 
for the (/-kernel shows that this kernel is also compute bound 
and attains approximately 21% peak performance on one 
SuperMUC core. 

5.1.2 Scaling Results 


Before conducting scaling experiments we evaluate the 
effect of the presented communication hiding schemes. All 
four possible combinations of hiding the communication of 
the (j>- and /r-field were tested. Figurej^shows the time spent 
in both communication routines. The amount of exchanged 
data is higher in the 0-communication, thus the overall com- 
munciation times are higher in this case. As expected, the 
effective communication times decrease for both fields when 
communication hiding is enabled. The remaining time in 
the communication routines is spent for packing and un¬ 
packing messages which cannot be overlapped. Overlapping 
0 -communication introduces additional overhead since the 
/i-kernel has to be split up into two parts. As a consequence, 
the temperature dependent value computation, which is done 
once for each x-y-slice in the no-overlap case, has to be done 
twice when 0 communication overlap is enabled. This over¬ 
head is much bigger than the benefit of communication hiding, 
thus the version with only /i communication hiding yields 
the best overall performance. 

In order to test the scalability of our implementation we 
execute weak scaling experiments on three of the largest 
German supercomputers. Weak scaling scenarios, where the 
domain size and the process count are increased by the same 
factor, are of practical relevance for this code, since the goal 
is to maximize the domain size while keeping the time to 



















solution constant. On SuperMUC, we scale three different 
scenarios up to 32,768 cores which were availabe to us. The 
full machine could not be utilized at the time of writing due 
to installation of a hardware npgrade. Figure]^ shows that 
because of the “shortcut” optimization, we obtain slightly 
different runtimes for the three scenarios, the “interface” sce¬ 
nario being the slowest due to higher workload in interface 
cells. In production runs, where all of the three block compo¬ 
sitions occur in the domain, the runtime is dominated by the 
interface blocks. We experimented with various load balanc¬ 
ing techniques offered by the waLBerla framework, which 
did, however, not decrease the total runtime significantly, 
because the moving window technique makes it possible to 
simulate only the interface region, such that, in production 
runs, most blocks have a composition similar to the “inter¬ 
face” benchmark. For the scaling experiments on Hornet and 
JUQUEEN, we only used the “interface” scenario, which is 
representative for the performance achieved in production 
runs. 

On SuperMUC and Hornet we placed one MPI process on 
each physical core. Simultaneous multithreading (SMT) did 
not improve the overall performance on these machines. On 
JUQUEEN however, 4-way SMT was employed to make full 
use of the in-order processing units. On this Blue Gene/Q 
machine we scaled up to 262,144 cores using 1,048,576 MPI 
processes. 


5.2 Simulation Results 

Due to the highly optimized code, we are able to study 
the evolution of microstructures in representative volume ele¬ 
ments (RVE), which are required for a better understanding 
of the various pattern formations. The performed simulations 
allow us to study the evolution of microstructures, especially 
in three dimensions. 

Common experimental analyzing methods, like micro¬ 
graphs, only provide two dimensional information. These 
micrographs can be used for a visual validation, but do not 
provide information on the three dimensional structure. In 
comparison to experimental micrographs, we can show a 
good visual accordance with cross sections of our three di¬ 
mensional simulations. A simulation with 2420 x 2420 x 1474 
cells, simulated on the Hornet supercomputer, is depicted in 


Figure 10(a) In the experiment as well as the simulation, 
the phases arrange in similar patterns as chained brick-like 
structures that are connected or form ring-like structures, as 
shown in Figure]^ Obtaining three dimensional information 
from experiments requires a large technical effort, using syn¬ 
chrotron tomography to resolve the different phases. A three 
dimensional reconstructed tomography result of directional 
solidification of the system Ag-AI-Cu from A. Dennstedt is 
depicted in Figure [T0(b)[ Even with these methods, only the 
final sample can be reconstructed, whereas the time evolu¬ 
tion cannot be observed. Using large-scale simulations, we 
found multiple microstructure characteristics, which could 
not be observed up to now due to small domain sizes. Three 
dimensional simulations are crucial to capture all relevant 
physical phenomena: In two dimensional cross sections paral¬ 
lel to the growth front, the phases simple arrange in different 
lamellae as brick-like structures while in three dimensions, 
various splits and merges of these lamellae can be observed. 
Single lamellae of the simulation in Figure are exempted 
in Figure |11| In Figure ll(a)[ two connected lamellae of 
the phase AhCu and in Figure [il(b)| two lamellae of the 


phase Ag 2 Al are shown. The evolution of the microstructure, 
especially the splitting of lamellae and merging, is visible, 
and allows us to study the stability of different phase ar¬ 
rangements. These phase arrangements influence the shape 
of the occurring phases and are of special importance to 
technical applications since the microstructure strongly influ¬ 
ences the mechanical properties, e.g. the elastic and plastic 
deformation behavior. Besides the good visual accordance 
of the large-scale simulations and two dimensional micro¬ 
graphs, also good agreement with the experimental 3D data 
is achieved. A more detailed comparison is shown in [16] 
and a quantitative comparison using Principal Component 
Analysis on two-point correlation is in preparation. In this 
publication, also the quantitative necessity of these large do¬ 
main sizes will be presented. In addition to the good visual 
agreement with the experiments, the simulations allow us to 
conduct parameter variations under well-dehned conditions, 
which provide excellent ways to study the underlying physical 
mechanisms of the resulting microstructure formation. 

6. CONCLUSION 

In this work, we have developed and evaluated a massively 
parallel simulation code for a thermodynamic consistent 
phase-held model, based on the grand potential approach, 
which runs efhciently on current supercomputing architec¬ 
tures. With this code, we simulated the directional solidih- 
cation of ternary eutectics, including four phases and three 
chemical species. Since big domain sizes are required to 
observe the formation of mircostructural patterns, a HPC 
implementation is crucial to get physical results for the consid¬ 
ered scenario. Starting from a general purpose and validated 
phase-held code of this complicated multiphase model for 
the solidihcation of alloys, we developed a new, highly op¬ 
timized implementation that can effectively utilize modern 
supercomputing architectures. We applied optimizations on 
various levels to the highly complex stencil code, ranging 
from scenario dependent model simplihcations to architec¬ 
ture specihc optimizations like explicit SIMD vectorization. 
Systematic node level performance engineering resulted in a 
factor 80 speedup compared to the original code, as well as 
25% of peak performance on node level. Additionally, com¬ 
munication hiding techniques were applied to optimize the 
performance on system level, too. We have shown excellent 
scaling results of our code on three of the largest German su¬ 
percomputers: SuperMUC, Hornet, and Juqueen. Only with 
this optimized implementation, it was possible to achieve the 
domain sizes necessary to gain microstructure patterns that 
are in good visual accordance with experimental data. 

For future work, we plan to switch from the explicit Eu¬ 
ler time stepping scheme to an implicit solver. With the 
presented, optimized model, further parameter studies will 
be conducted. Eurthermore, support for a wider range of 
architectures (Xeon Phi, GPU) is in preparation. 
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Figure 10: Three dimensional simulation and experimental results of directional solidification of the ternary 
eutectic system Ag-AI-Cu 
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Figure 11: Exempted lamellae from the simulation depicted in Figure |10[ The lamellae grew from left to 
right. 
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